Skip to content

ENH(string dtype): fallback for HDF5 with UTF-8 surrogates #60993

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 16, 2025

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Feb 23, 2025

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

One oddity I encountered here: only the index is currently encoded / decoded on write / read operations respectively. Columns on the other hand are written and read as strings. I haven't looked into why this is, and if we can avoid encode/decode for index.

It seemed best to only fallback when errors="surrogatepass", though that use is a bit odd since there is no actual decode occurring. If there are other approaches (perhaps always falling back?), I'm certainly open to them.

Another option here is to fallback to object instead of StringDtype(storage="python"), but it seems with infer_string=True the latter is more appropriate.

@rhshadrach rhshadrach added IO HDF5 read_hdf, HDFStore Strings String extension data type and string data labels Feb 23, 2025
@rhshadrach rhshadrach added this to the 2.3 milestone Feb 23, 2025
@@ -5224,7 +5246,7 @@ def _convert_string_array(data: np.ndarray, encoding: str, errors: str) -> np.nd
# encode if needed
if len(data):
data = (
Series(data.ravel(), copy=False)
Series(data.ravel(), copy=False, dtype="object")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We go immediately back to NumPy, so no reason to use string dtypes here.

Comment on lines +5289 to +5291
ser = Series(data, copy=False).str.decode(
encoding, errors=errors, dtype="object"
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We go immediately back to NumPy, so no reason to use string dtypes here.

and isinstance(values, np.ndarray)
and is_string_array(values, skipna=True)
):
result = result.astype(StringDtype(na_value=np.nan))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added back in #54431. I do not believe it is no longer necessary - we will infer string in the Series constructor.

@rhshadrach rhshadrach marked this pull request as ready for review February 23, 2025 15:41
@rhshadrach rhshadrach changed the title ENH(string dtype): object fallback for HDF5 with UTF-8 surrogates ENH(string dtype): fallback to Python storage for HDF5 with UTF-8 surrogates Feb 23, 2025
@rhshadrach rhshadrach changed the title ENH(string dtype): fallback to Python storage for HDF5 with UTF-8 surrogates ENH(string dtype): fallback for HDF5 with UTF-8 surrogates Feb 23, 2025
@jorisvandenbossche
Copy link
Member

Looks good to me!

Another option here is to fallback to object instead of StringDtype(storage="python"), but it seems with infer_string=True the latter is more appropriate.

I think there is also something to say for using object dtype here, to make it clearer to the user that something is going wrong and that it is not actually valid (UTF-8) strings.
But no strong opinion.

@rhshadrach
Copy link
Member Author

to make it clearer to the user that something is going wrong and that it is not actually valid (UTF-8) strings.

Would raising a warning be a better flag to the user?

@rhshadrach
Copy link
Member Author

rhshadrach commented Mar 22, 2025

Thinking about this more, I would prefer to not issue a warning. The user is already passing surrogatepass, they are aware of surrogate issues. We should not be noisy.

@rhshadrach
Copy link
Member Author

@jorisvandenbossche - friendly ping.

1 similar comment
@rhshadrach
Copy link
Member Author

@jorisvandenbossche - friendly ping.

@rhshadrach
Copy link
Member Author

@jorisvandenbossche @mroeschke - I think this is ready for merging.

@jorisvandenbossche jorisvandenbossche merged commit a393c31 into pandas-dev:main Apr 16, 2025
42 checks passed
Copy link

lumberbot-app bot commented Apr 16, 2025

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.3.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 a393c31931af4fc69ea0006fd851eeb00175be3c
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #60993: ENH(string dtype): fallback for HDF5 with UTF-8 surrogates'
  1. Push to a named branch:
git push YOURFORK 2.3.x:auto-backport-of-pr-60993-on-2.3.x
  1. Create a PR against branch 2.3.x, I would have named this PR:

"Backport PR #60993 on branch 2.3.x (ENH(string dtype): fallback for HDF5 with UTF-8 surrogates)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HDF5 read_hdf, HDFStore Still Needs Manual Backport Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants